UCT/CUDA: Add support for flush_ep #10991

shasson5 · 2025-11-05T12:49:32Z

What?

Add support for flush_ep

Why?

Required for cuda_ipc to be used in gtests

Summary by CodeRabbit

New Features
- Added per-endpoint CUDA stream flush API that synchronizes all active streams and supports completion callbacks.
Bug Fixes
- Improved flush progress behavior to avoid treating flush events as regular events, preventing premature completions.
- Enhanced flush error handling and resource cleanup to reduce leaks and ensure reliable rollback on failure.

coderabbitai · 2025-11-05T12:49:46Z

Walkthrough

Adds a CUDA-specific endpoint flush API that enqueues per-stream flush events tied to a shared flush descriptor, introduces flush event types and a flush descriptor pool, updates progress to treat flush events specially, and adds rollback and cleanup on allocation/enqueue failures.

Changes

Cohort / File(s)	Summary
Flush Type Definitions `src/uct/cuda/base/cuda_iface.h`	Adds `uct_cuda_flush_desc_t` (stream counter + completion) and `uct_cuda_flush_stream_desc_t` (per-stream event embedding); extends `uct_cuda_iface_t` with a `flush_mpool` member and declares `uct_cuda_base_ep_flush()`.
Flush Implementation `src/uct/cuda/base/cuda_iface.c`	Implements `uct_cuda_base_ep_flush()` public API, per-stream callback `uct_cuda_base_stream_flushed_cb()` that decrements the shared flush counter and completes when zero, helper `uct_cuda_base_event_is_flush()`, introduces `flush_mpool` ops and pool lifecycle (init/cleanup), updates progress to skip treating flush events as regular completions, and adds rollback on allocation/enqueue failures.
Interface Handlers `src/uct/cuda/cuda_copy/cuda_copy_iface.c`, `src/uct/cuda/cuda_ipc/cuda_ipc_iface.c`	Replaces `ep_flush` op bindings from generic `uct_base_ep_flush` to the new `uct_cuda_base_ep_flush` for CUDA copy and CUDA IPC ifaces.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant EP as Endpoint
    participant IFACE as CUDA_Base_IFACE
    participant Stream as GPU_Stream
    participant Progress as Progress_Loop

    Caller->>EP: ep_flush(comp)
    activate EP
    EP->>IFACE: uct_cuda_base_ep_flush()
    activate IFACE
    IFACE->>IFACE: alloc flush_desc (stream_counter = N)
    loop per active stream (N)
        IFACE->>Stream: enqueue per-stream flush event (flush_stream_desc)
    end
    EP-->>Caller: return (pending)
    deactivate EP
    deactivate IFACE

    par Stream completions & progress polling
        Stream->>IFACE: stream event triggers callback
        activate IFACE
        IFACE->>IFACE: uct_cuda_base_stream_flushed_cb() — decrement counter
        alt counter == 0
            IFACE->>Caller: complete user completion, free flush_desc
        end
        deactivate IFACE
        Progress->>Progress: poll events (skip flush events)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Check flush_mpool initialization/destruction and error-path cleanup.
Verify atomic/thread-safe handling of stream_counter in callbacks.
Confirm progress loop correctly identifies and skips flush events.
Review rollback logic for partially enqueued flush_stream_desc items.

Poem

🐇 I nudged each stream to hop in line,

small flags that wait, then tick and shine;
the counter counts each tiny feat,
when last one lands, the flush is sweet —
the rabbit hums, the buffers clear.

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'UCT/CUDA: Add support for flush_ep' accurately reflects the main change: implementing flush_ep functionality for CUDA endpoints, as evidenced by the new uct_cuda_base_ep_flush API and integration across CUDA interfaces.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bbcc1d8 and e3d088d.

📒 Files selected for processing (1)

src/uct/cuda/base/cuda_iface.c (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/uct/cuda/base/cuda_iface.c (2)

src/ucs/debug/memtrack.c (2)

ucs_free (368-372)

ucs_malloc (328-334)

src/ucs/datastruct/mpool.c (7)

ucs_mpool_get (215-218)

ucs_mpool_put (220-223)

ucs_mpool_chunk_malloc (326-330)

ucs_mpool_chunk_free (332-335)

ucs_mpool_params_reset (60-73)

ucs_mpool_init (81-147)

ucs_mpool_cleanup (149-198)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: UCX PR (Static_check Static checks)
GitHub Check: UCX PR (Codestyle AUTHORS file update check)
GitHub Check: UCX PR (Codestyle ctags check)
GitHub Check: UCX PR (Codestyle codespell check)
GitHub Check: UCX PR (Codestyle format code)
GitHub Check: UCX PR (Codestyle commit title)
GitHub Check: UCX release DRP (Prepare CheckRelease)
GitHub Check: UCX release (Prepare CheckRelease)
GitHub Check: UCX snapshot (Prepare Check)

🔇 Additional comments (7)

src/uct/cuda/base/cuda_iface.c (7)

260-265: LGTM: Clean flush event detection.

The function pointer comparison is an effective way to identify flush events. The dual check (non-NULL comp and matching function) ensures correctness.

275-290: LGTM: Correct flush event handling in progress path.

The updated logic properly distinguishes flush events from regular events:

Flush events are extracted immediately (serving as queue markers)

Regular events are queried for completion before extraction

Flush events skip complete_event invocation (line 285-287) since they represent flush markers rather than actual CUDA operations

The flush-as-marker design is sound: when a flush event reaches the front of the queue, all preceding operations are guaranteed complete.

451-457: LGTM: Appropriate mpool configuration for flush descriptors.

Unlike regular event descriptors that require CUDA event initialization, flush descriptors are simple structures that use queue ordering for synchronization. NULL init/cleanup callbacks are correct here.

465-486: LGTM: Proper flush mpool initialization.

The constructor correctly:

Initializes mpool parameters with appropriate element size

Checks initialization status and returns errors properly

Follows the same pattern as event descriptor mpool initialization elsewhere in the codebase

502-502: LGTM: Proper flush mpool cleanup with leak checking.

The destructor correctly cleans up the flush mpool with leak checking enabled, matching the cleanup pattern for other mpools in the codebase.

209-211: No issues found after verification.

The NULL completion handling is correct and follows expected UCT semantics. When comp == NULL, the function returns UCS_INPROGRESS if there are active streams, indicating work is pending without a callback request—consistent with other async transports like UD. The mechanism is well-documented in the code comments explaining the flush event queue mechanism.

242-257: Rollback logic appears sound — no issues found.

The verification confirms the implementation is correct:

ucs_queue_iter_elem macro correctly uses container_of to retrieve queue descriptors from the iterator over the embedded queue field

Tail element assumption is valid — the rollback retrieves exactly stream_counter elements (matching successful pushes) from each queue; since push appends to tail and retrieval grabs tail, the most-recently-pushed item is correctly identified for rollback

Type cast is safe — super is the first field of uct_cuda_flush_stream_desc_t, so casting uct_cuda_event_desc_t* back to the parent type is valid

The code relies on single-threaded enqueue semantics with no concurrent progress drains during the error path, which appears consistent with the endpoint context.

UCT/CUDA: Add support for flush_ep

0467037

shasson5 requested review from amastbaum, gleon99 and yosefe November 5, 2025 12:51

UCT/CUDA: minor fix

bbcc1d8

This comment was marked as spam.

Sign in to view

openucx deleted a comment from coderabbitai bot Nov 5, 2025

UCT/CUDA: minor fix

e3d088d

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

openucx deleted a comment from coderabbitai bot Nov 6, 2025

amastbaum approved these changes Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UCT/CUDA: Add support for flush_ep #10991

UCT/CUDA: Add support for flush_ep #10991

Uh oh!

shasson5 commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

This comment was marked as spam.

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UCT/CUDA: Add support for flush_ep #10991

Are you sure you want to change the base?

UCT/CUDA: Add support for flush_ep #10991

Uh oh!

Conversation

shasson5 commented Nov 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

This comment was marked as spam.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shasson5 commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 5, 2025 •

edited

Loading